TEHCNICAL  REPORT 
R-381 

February  2012 


Controlling  Selection  Bias  in  Causal  Inference 


Elias  Bareinboim 

Cognitive  Systems  Laboratory 
Department  of  Computer  Science 
University  of  California,  Los  Angeles 
Los  Angeles,  CA.  90095 

eb@cs.ucla.edu 

Abstract 

Selection  bias,  caused  by  preferential  exclu¬ 
sion  of  samples  from  the  data,  is  a  major 
obstacle  to  valid  causal  and  statistical  infer¬ 
ences;  it  cannot  be  removed  by  randomized 
experiments  and  can  hardly  be  detected  in 
either  experimental  or  observational  studies. 

This  paper  highlights  several  graphical  and 
algebraic  methods  capable  of  mitigating  and 
sometimes  eliminating  this  bias.  These  non- 
parametric  methods  generalize  previously  re¬ 
ported  results,  and  identify  the  type  of  knowl¬ 
edge  that  is  needed  for  reasoning  in  the  pres¬ 
ence  of  selection  bias.  Specifically,  we  derive  a 
general  condition  together  with  a  procedure 
for  deciding  recoverability  of  the  odds  ratio 
(OR)  from  s-biased  data.  We  show  that  re¬ 
coverability  is  feasible  if  and  only  if  our  condi¬ 
tion  holds.  We  further  offer  a  new  method  of 
controlling  selection  bias  using  instrumental 
variables  that  permits  the  recovery  of  other 
effect  measures  besides  OR. 

1  Introduction 

Selection  bias  is  induced  by  preferential  selection  of 
units  for  data  analysis,  usually  governed  by  unknown 
factors  including  treatment,  outcome  and  their  con¬ 
sequences.  Case-control  studies  in  Epidemiology  are 
particularly  susceptible  to  such  bias,  e.g.,  cases  may  be 
reported  only  when  the  outcome  (disease  or  complica¬ 
tion)  is  unusual,  while  non-cases  remain  unreported 
(see  (Glymour  and  Greenland,  2008;  Robins  et  al., 
2000;  Robins,  2001;  Hernan  et  al.,  2004)). 

Appearing  in  Proceedings  of  the  15th  International  Con¬ 
ference  on  Artificial  Intelligence  and  Statistics  (AISTATS) 
2012,  La  Palma,  Canary  Islands.  Volume  22  of  JMLR: 
W&CP  22.  Copyright  2012  by  the  authors. 
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To  illuminate  the  nature  of  this  bias,  consider  the 
model  of  Fig.  1  (a)  in  which  S'  is  a  variable  affected 
by  both  X  (treatment)  and  Y  (outcome),  indicating 
entry  into  the  data  pool.  Such  preferential  selection  to 
the  pool  amounts  to  conditioning  on  S,  which  creates 
spurious  association  between  X  and  Y  through  two 
mechanisms.  First  conditioning  on  S  induces  spurious 
association  between  its  parents,  X  and  Y .  Second,  S  is 
also  a  descendant  of  a  “virtual  collider”  Y,  whose  par¬ 
ents  are  X  and  the  error  term  Uy  (also  called  “omitted 
factors”  or  “hidden  variable”)  which  is  always  present, 
though  often  not  shown  in  the  diagram.1 

A  medical  example  of  selection  bias  was  reported  in 
(Horwitz  and  Feinstein,  1978),  and  subsequently  stud¬ 
ied  in  (Hernan  et  al.,  2004;  Geneletti  et  al.,  2009),  in 
which  it  was  noticed  that  the  effect  of  Oestrogen  (A) 
on  Endometrial  Cancer  (Y)  was  overestimated  in  the 
data  studied.  One  of  the  symptoms  of  the  use  of  Oe¬ 
strogen  is  vaginal  bleeding  {W)  (Fig.  1(c)),  and  the 
hypothesis  was  that  women  noticing  bleeding  are  more 
likely  to  visit  their  doctors,  causing  women  using  Oe¬ 
strogen  to  be  overrepresented  in  the  study. 

In  causal  inference  studies,  the  two  most  common 
sources  of  bias  are  confounding  (Fig.  1(b))  and  selec¬ 
tion  (Fig.  1(a)).  The  former  is  a  result  of  treatment  X 
and  outcome  Y  being  affected  by  a  common  omitted 
variables  U,  while  the  latter  is  due  to  treatment  or  out¬ 
come  (or  its  descendants)  affecting  the  inclusion  of  the 
subject  in  the  sample  (indexed  by  S).  In  both  cases,  we 
have  unblocked  extraneous  “flow”  of  influence  between 
treatment  and  outcome,  which  appear  under  the  rubric 
of  “spurious  correlation.”  It  is  called  spurious  because 
it  is  not  part  of  what  we  seek  to  estimate  -  the  causal 
effect  of  X  on  Y  in  the  target  population.  In  the  case  of 
confounding,  bias  occurs  because  we  cannot  condition 
on  the  unmeasured  confounders,  while  in  selection,  the 
distribution  is  always  conditioned  on  S. 


1See  (Pearl,  2009,  pp.  339-341)  for  further  explanation 
of  this  bias  mechanism. 
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Formally,  the  distinction  between  these  biases  can  be 
articulated  thus:  confounding  bias  is  any  X  —  Y  as¬ 
sociation  that  is  attributable  to  selective  choice  of 
treatment,  while  selection  bias  is  any  association  at¬ 
tributable  to  selective  inclusion  in  the  data  pool.  Op¬ 
erationally,  confounding  bias  can  be  eliminated  by  ran¬ 
domization  -  selection  bias  cannot.  Given  this  distinc¬ 
tion,  the  two  biases  deserve  different  qualitative  treat¬ 
ment  and  entail  different  properties,  which  we  explore 
in  this  paper.  Remarkably,  there  are  special  cases  in 
which  selection  bias  can  be  detected  even  from  obser¬ 
vations,  as  in  the  form  of  a  non-chordal  undirected 
component  (Zhang,  2008). 

As  an  interesting  corollary  of  this  distinction,  it  was 
shown  (Pearl,  2010)  that  confounding  bias,  if  such  ex¬ 
ists,  can  be  amplified  by  conditioning  on  an  instrumen¬ 
tal  variable  Z  (Fig.  1(d)).  Selection  bias,  on  the  other 
hand,  remains  invariant  under  such  conditioning. 

We  will  use  instrumental  variables  for  the  removal  of 
selection  bias  in  the  presence  of  confounding  bias,  as 
shown  in  the  scenario  of  Fig.  1(f).  Whereas  instru¬ 
mental  variables  cannot  ensure  nonparametric  identi¬ 
fication  of  average  causal  effects,  they  can  help  provide 
reasonable  bounds  on  those  effects  as  well  as  point  es¬ 
timates  in  some  special  cases  (Balke  and  Pearl,  1997). 
Since  the  bounding  analysis  assumed  no  selection  bias, 
the  question  arises  whether  similar  bounds  can  be  de¬ 
rived  in  the  presence  of  selection  bias.  We  will  show 
that  selection  bias  can  be  removed  entirely  through 
the  use  of  instrumental  variables,  therefore,  the  bounds 
on  the  causal  effect  will  be  narrowed  to  those  obtained 
under  the  selection- free  assumption. 

This  result  is  relevant  in  many  areas  because  selection 
bias  is  pervasive  in  almost  all  empirical  studies,  in¬ 
cluding  Machine  Learning,  Statistics,  Social  Sciences, 
Economics,  Bioinformatics,  Biostatistics,  Epidemiol¬ 
ogy,  Medicine,  etc.  For  instance,  one  version  of  selec¬ 
tion  bias  was  studied  in  Economics,  and  led  to  the 
celebrated  method  developed  by  (Heckman,  1970).  It 
removes  the  bias  through  a  two-step  process  which  as¬ 
sumes  linearity,  normality  and,  a  probabilistic  model 
of  the  selection  mechanism. 

Machine  learning  tasks  suffer  from  a  similar  prob¬ 
lem  when  training  samples  are  selected  preferentially, 
depending  on  feature-class  combinations  that  differ 
from  those  encountered  in  the  target  environment 
(Zadrozny,  2004;  Smith  and  Elkan,  2007;  Storkey, 
2009;  Hein,  2009). 

In  Epidemiology,  the  prevailing  approach  is  due  to 
James  Robins  (Robins  et  al.,  2000;  Hernan  et  al., 
2004),  which  assumes  knowledge  of  the  probability  of 
selection  given  treatment.  In  some  special  cases,  this 
probability  can  be  estimated  from  data,  requiring  a 
record,  for  each  treatment  given,  whether  a  follow  up 


Figure  1:  Different  scenarios  considered  in  this  paper. 
(a,b)  Simplest  examples  of  selection  and  confounding 
bias,  respectively,  (c)  Typical  study  with  intermediary 
variable  W  between  X  and  selection,  (d)  Instrumen¬ 
tal  variable  with  selection  bias,  (e)  Selection  combined 
with  confounding,  (f)  Instrumental  variable  with  con¬ 
founding  and  selection  bias  simultaneously  present. 

outcome  (Y)  is  reported  or  not.  We  do  not  rely  on  such 
knowledge  in  this  paper  but  assume,  instead,  that  no 
data  of  treatment  or  outcome  is  available  unless  a  case 
is  reported  (via  S). 

Contributions 

Our  contributions  are  as  follows.  In  Section  2,  we  give 
a  complete  graphical  condition  under  which  the  popu¬ 
lation  odds  ratio  (OR)  and  a  covariate-specific  causal 
odds  ratio  can  be  recovered  from  selection-biased  data 
(Theorem  1).  We  then  devise  an  effective  procedure  for 
testing  this  condition  (Theorem  2,  3).  These  results, 
although  motivated  by  causal  considerations,  are  ap¬ 
plicable  to  classification  tasks  as  well,  since  the  process 
of  eliminating  selection  bias  is  separated  from  that  of 
controlling  for  confounding  bias. 

In  Section  3,  we  present  universal  curves  that  show  the 
behavior  of  OR  as  the  distribution  P(y  \  x)  changes, 
and  how  the  risk  ratio  (RR)  and  risk  difference  (RD) 
are  related  to  OR.  We  further  show  that  if  one  is  inter¬ 
ested  in  recovering  RR  and  RD  under  selection  bias, 
knowledge  of  P{X)  is  sufficient  for  recovery. 

In  Section  4,  we  advance  for  other  measures  of  effects 
besides  odds  ratio,  and  show  that  even  when  confound¬ 
ing  and  selection  biases  are  simultaneously  present 
(Fig.  1(e)),  the  latter  can  be  entirely  removed  with 
the  help  of  instrumental  variables  (Theorem  4).  This 
result  is  surprising  for  two  reasons:  first,  we  generally 
do  not  expect  selection  bias  to  be  removable;  second, 
bias  removal  in  the  presence  of  confounding  is  gener¬ 
ally  expected  to  be  a  more  challenging  task.  We  finally 
show  how  this  result  is  applicable  to  scenarios  where 
other  structural  assumptions  hold,  for  instance,  when 
an  instrument  is  not  available  but  a  certain  back-door 
admissible  set  can  be  identified  (Corollary  4). 
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2  Selection  bias  in  a  chain  structure 
and  its  graphical  generalizations 

The  chain  structure  of  Figure  2(a)  is  the  sim¬ 
plest  structure  exhibiting  selection  bias.  The  intuition 
gained  from  analyzing  this  example  will  serve  as  a  basis 
for  subsequently  treating  more  complicated  structures. 

Consider  a  study  of  the  effect  of  a  training  program 
( A )  on  earnings  after  5  years  of  completion  (Y),  and 
assume  that  there  is  no  confounding  between  treat¬ 
ment  and  outcome.  Assume  that  subjects  achieving 
higher  income  tend  to  report  their  status  more  fre¬ 
quently  than  those  with  lower  income.  The  qualitative 
causal  assumptions  are  depicted  in  Fig.  2(a).  Given 
that  all  available  data  is  obtained  under  selection  bias, 
is  the  unbiased  odds  ratio  recoverable? 

To  address  this  problem,  we  explicitly  add  a  variable  S 
to  represent  the  selection  mechanism,  and  assume  that 
S  =  1  represents  presence  in  the  sample,  and  zero  oth¬ 
erwise.  We  will  refer  to  samples  selected  by  such  mech¬ 
anism  as  “s-biased” .  A  similar  representation  was  used 
in  (Cooper,  1995;  Lauritzen  and  Richardson,  2008; 
Geneletti  et  al.,  2009;  Didelez  et  al.,  2010).  In  the 
chain  structure  of  Fig.  2(a),  A  is  d-separated  from 
S  by  Y,  which  implies  the  conditional  independence 
{X  _LL  S  |  Y),  and  encodes  the  assumption  that  entry 
to  the  data  pool  is  determined  by  the  outcome  Y  only, 
not  by  X.  We  define  next  some  key  concepts  used  along 
the  paper  and  state  some  results  that  will  support  our 
analysis. 

Definition  1  (Odds  ratio).  Consider  two  variables 
X  and  Y  and  a  set  Z,  the  conditional  odds  ratio 
OR(Y,  X  |  Z  =  z)  is  given  by  the  ratio:  ( Pr(y  \ 
z’  X')/Pr(y'  |  z ,x'))/(Pr(y  \  z ,x)/Pr(y'  \  z,x)). 

OR(Y,  X  |  Z)  measures  the  strength  of  association  be¬ 
tween  X  and  Y  conditioned  on  Z  and  it  is  symmetric, 
i.e.,  OR{Y,  X  |  Z)  =  OR{X,Y  |  Z). 

Definition  2  (G-Recoverability).  Given  a  graph  G, 
OR{X,  Y  |  Z)  is  said  to  be  G-recoverable  from  s- 
biased  data  if  the  assumptions  embedded  in  G  renders 
it  expressible  in  terms  of  the  observable  distribution 
P(Vxy  |  S  =  1)  where  Vxy  =  V  \  {S}.  Formally,  for 
every  two  probability  distributions  Pi(.)  and  P2{-)  com¬ 
patible  with  G,  -Pi(vxy  =|  S  =  1)  =  P2 (vxy  |  S  =  1) 
implies  ORx  (X,  Y  |  Z)  =  OR2  ( X,  Y  |  Z) . 

Definition  3  (Collapsibility).  Consider  two  variables 
X  and  Y  and  disjoint  sets  Z  and  W.  We  say  that 
the  odds  ratio  OR(X,Y  |  Z,W)  is  collapsible  over 
W  if  OR(X,Y  |  Z  =  z,  W  =  w)  =  OR(X,  Y  \ 
Z  =  z,W  =  w')  =  OR(X,  Y  |  Z  =  z),  for  allw^  w'. 

Definition  3  and  the  following  Lemma  are  stated  in 
(Didelez  et  al.,  2010)  and  are  based  on  long  tradition 


(a)  (b)  (C) 


Figure  2:  (a)  Chain  graph  where  X  represents  treat¬ 
ment,  Y  is  the  outcome,  and  S  an  indicator  variable 
for  the  selection  mechanism,  (b)  Scenario  where  there 
exists  a  blocking  set  from  {X,  Y}  to  S  yet  the  OR  is 
not  G-recoverable.  (c)  Example  where  the  c-specific 
OR  is  G-recoverable. 

in  Epidemiology  starting  with  (Cornfield,  1951)  and 
followed  by  (Whittemore,  1978;  Geng,  1992). 2 

Lemma  1.  For  any  two  sets,  Z  and  W,  the  condi¬ 
tional  odds  ratio  OR(Y,X  |  Z,W)  is  collapsible  over 
W  (that  is,  OR(Y,X  |  Z,W)  =  OR{Y,X  \  Z)),  if 
either  (XJ1W|  {Y,  Z})  or  (Y  _LL  W  |  {X,  Z}). 

The  following  Corollary  provides  a  graphical  test  for 
G-recoverability  (Def.  2)  based  on  Lemma  1: 

Corollary  1.  Given  a  graph  G  in  which  node  S  rep¬ 
resents  selection,  the  OR(X ,  Y  |  Z)  is  G-recoverable 
from  s-biased  data  ifZ  is  such  that  ( X  _LL  S  |  {Y,  Z })g 
or  (Y  _LL  5  |  {X,  Z})g- 

There  is  an  important  subtlety  here.  One  might  sur¬ 
mise  that  selection  bias  of  OR(X,  Y)  can  be  removed 
if  the  condition  of  Corollary  1  holds,  i.e.,  there  exists 
a  separating  set  Z  such  that  (X  _LL  S  |  {Y,  Z })g  or 
(Y  _LL  S  |  {X,Z})g,  but  this  is  not  the  case.  Consider 
Fig.  2(b)  where  the  set  Z  d-separates  {X,  Y}  from 
S  and  therefore  permits  us  to  remove  S  by  writing 
OR(X,Y  |  Z,  S  =  1)  as  OR(X,Y  |  Z),  yet  the  uncon¬ 
ditional  OR  is  not  G-recoverable  because  we  cannot 
re-apply  the  condition  of  Corollary  1  to  eliminate  Z 
from  OR(X,Y  |  Z).  Moreover,  the  resulting  quantity, 
OR(X ,  Y  |  Z),  though  estimable  for  every  level  Z  =  z, 
does  not  represent  a  meaningful  relation  for  decision 
making  or  interpretation,  because  it  does  not  stand 
for  a  causal  effect  in  a  stable  subset  of  individuals  (see 
discussion  about  the  causal  OR  at  the  end  of  this  sec¬ 
tion).  Since  Z  is  A-dependent  in  G,  the  class  of  units 
for  which  Z  =  z  under  do(X  =  1)  is  not  the  same  as 
the  class  of  units  for  which  Z  =  z  under  do(X  =  0). 
The  conditional  odds  ratio  OR( X,  Y  \  Z)  would  be 
meaningful  only  if  Z  is  restricted  to  pre-treatment  co¬ 
variates,  which  are  A-invariant,  hence  stable. 

2  Cornfield’s  result  and  some  of  its  graphical  ramifica¬ 
tions  were  brought  to  our  attention  by  Sander  Greenland. 
See  also  (Greenland  and  Pearl,  2011). 
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We  next  introduce  a  criterion,  followed  by  a  proce¬ 
dure  to  decide  whether  it  is  legitimate  to  replace  Z 
with  a  set  C  of  pre-treatment  covariates,  for  which 
OR(Y ,  X  |  C)  is  a  meaningful  c-specific  causal  ef¬ 
fect.  Typical  examples  of  c-specific  effects  would  be 
C  =  {age,  sex}  or,  when  average  behavior  is  desired, 
C  =  {}.' 

Definition  4  (OR-admissibility).  A  set  Z  = 
{Zi, ...,  Zn}  is  OR-admissible  relative  to  an  ordered 
triplet  (X,  Y,  C)  whenever  an  ordering  (Z4,...,Zn) 
exists  such  that  for  each  Z]-,  either  (X  _LL  Zk  \ 
C,  Y,  Z\, . ..,  Z/-—x)  or  (Y  _LL  Zk  |  C,  X,  Zu  ...,  Zk_x). 

Corollary  2  (Didelez  et  al.  (2010)).  OR-admissibility 
of  Z  implies  OR(Y,X  |  C,Z)  =  OR(Y,X  |  C). 

This  Corollary  follows  by  successive  application  of 
Lemma  1  to  the  elements  Z\, ....  Zn  of  Z. 

Theorem  1  (OR  G-recoverability).  Let  graph  G  con¬ 
tain  the  arrow  X  — >  Y  and  a  set  C  of  measured 
X -independent  covariates.  The  c-specific  odds  ratio 
OR(Y ,  X  |  C)  is  G-recoverable  from  s-biased  data  if 
and  only  if  there  exists  an  additional  set  Z  of  mea¬ 
sured  variables  such  that  the  following  conditions  hold 
in  G: 

1.  (X  JL  S  |  {Y,  Z,  C})G  or  (715|  {X,  Z,  C})G. 

2.  Z  is  OR-admissible  relative  to  (X,  Y,  C). 

Moreover,  OR(Y. ,  X  \  C)  =  OR(Y,  X  |  C,  Z,  S  =  1).  3 

Proof.  See  Appendix.  □ 

Note  that  unlike  the  control  of  confounding,  which  re¬ 
quires  averaging  over  the  adjusted  covariates,  a  single 
instantiation  of  the  variables  in  Z  is  all  that  is  needed 
for  removing  selection  bias. 

Let  us  consider  the  causal  story  of  section  1  concern¬ 
ing  the  effect  of  Oestrogen  (X)  on  Endometrial  Cancer 
(Y)  as  depicted  in  in  Fig.  1(c).  This  problem  is  solv¬ 
able  by  setting  Z  =  {W}  and  applying  Theorem  1  -  we 
can  readily  verify  that  Z  is  OR-admissible  relative  to 
(X,  Y,  {})  (i.e. ,  (W  JL  Y  |  X)),  and  (X  iL  S  \  {Y,W}) 
holds.  Thus,  we  can  write  OR(Y,X )  =  OR(Y,X  \ 
W)  =  OR(X,  Y  |  W)  =  OR(X,Y  \  W,S  =  1),  which 
shows  a  mapping  from  the  target  (unbiased)  quantity 
(without  any  S)  to  the  s-biased  data  (conditioned  on 
S  =  1,  which  was  measured).  (In  the  sequel  we  will 

3This  Theorem  builds  on  and  extends  the  results  in 
(Didelez  et  al.,  2010)  which  are  summarized  by  Definition 

4  and  Corollary  2.  First,  it  supplements  the  sufficient  con¬ 
dition  with  its  necessary  counterpart.  This  is  made  possi¬ 
ble  by  defining  G-recoverability  in  terms  of  identifiability 
(Def.  2).  Second,  Theorem  1  explicitly  avoid  meaningless 
ORs  (i.e.,  OR(X,  Y  |  Z),  where  Z  is  A'-dependent).  Finally, 
the  proof  of  the  sufficiency  part  prepares  the  ground  for  a 
procedure  for  finding  an  admissible  sequence  if  such  exists, 
to  be  shown  next. 


(a)  (b) 


Figure  3:  Scenario  where  OR  is  G-recoverable  and  Z  = 
{Wi,  W2,  W4}  (a),  and  it  is  not  G-recoverable  in  (b). 

drop  G  finding  no  need  to  distinguish  conditional  in¬ 
dependencies  from  d-separation  statements.)  4 

Theorem  1  defines  the  boundary  that  distinguishes  the 
class  of  graphs  that  permit  G-recoverability  of  OR 
from  those  that  do  not.  To  show  the  power  of  The¬ 
orem  1,  let  us  consider  the  more  intricate  scenario  of 
Fig.  3(a),  in  which  Z  =  {W\ .  IY2,  W4}  satisfies  the 
conditions  of  Theorem  1.  This  can  be  seen  through 
the  following  sequence  of  reductions  verified  by  the 
graph:  (X  JL  S  \  {Y,W1,W2,W4})  -»  (Y  JL  'w2  | 
{X,  W\,W4})  ->  (X  JL  W\  I  {Y,Wi})  ->  (Y  JL  W4  j 
X).  The  final  result  is 

OR(Y,  X)  =  OR(Y,  X  |  Wu  W2,  W4,  S=  1) 

where  the  term  on  the  left  is  our  target  quantity 
and  the  one  on  the  right  is  estimable  from  the  s- 
biased  data.  Fig.  3(b)  shows  an  example  where  OR 
is  not  G-recoverable,  because  we  must  start  with  Z  = 
{M-j ,  W2,  M-Lj,  W4}  or  Z  =  {W4,W3,W4}  to  separate 
S  from  X  or  Y,  respectively  -  these  two  sets  are  not 
OR-admissible  since  each  set  contains  the  variable  W3 
which  cannot  be  separated  from  X  or  Y  by  any  set. 

Theorem  1  relies  on  OR-admissibility,  for  which  Defi¬ 
nition  4  gives  a  declarative,  non-procedural  criterion. 
Taken  literally,  it  requires  that  we  first  find  a  proper 
Z  and  then,  out  of  the  n!  orderings  of  the  elements 
in  Z ,  find  one  that  will  satisfy  the  d-separation  tests 
specified  in  Definition  4.  We  will  now  supplement  The¬ 
orem  1  with  a  simple  graphical  condition,  followed  by 
an  effective  procedure  for  finding  such  a  sequence  if 
one  exists. 

Theorem  2.  Let  graph  G  contain  the  arrow  X  — > 
Y,  a  necessary  condition  for  G  to  permit  the  G- 
recover ability  of  OR(Y ,  X  j  C)  for  a  given  set  C  of 
pre-treatment  covariates  is  that  S  and  every  ancestor 
A.i  of  S  that  is  also  a  descendant  of  X  have  a  separat- 

4Furthermore,  the  graph  symmetric  to  Fig.  1(c)  where 
the  positions  of  X  and  Y  are  interchanged  yields  the  same 
result.  Similarly,  another  common  variant  of  Fig.  1(c),  with 
the  edge  X  — *  W  reversed,  is  solvable  as  well. 
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ing  set  T,  that  either  d-separates  Aj  from  X  given  Y , 
or  d-separates  Ai  from  Y  given  X.  5 

Proof.  See  Appendix.  □ 

Theorem  3.  Let  G  be  a  DAG  containing  the  arrow 
X  — >  Y  and  two  sets  of  variables,  measured  V  and 
unmeasured  U.  A  necessary  and  sufficient  condition 
for  G  to  permit  the  G -recoverability  of  OR(Y,  X  |  C) 
for  a  given  set  C  of  pre-treatment  variables  is  when  the 
sink-procedure  below  terminates.  Moreover,  OR(Y,  X  \ 
C)  =  OR(Y ,  X  |  C,Z,T,S  =  1),  where  Z  =  (An(S)  \ 
An(Y ))  n  V  and  T  is  given  by  the  sink-procedure. 

Procedure  (Sink  reduction) 

1.  Set  T  =  {},  and  consider  Z  as  previously  defined. 
Remove  V  \  An(Y  U  S)  from  G,  and  name  the 
new  graph  G*.  Consider  an  ordering  compatible 
with  G*  such  that  Z,  <  Zj  whenever  Zi  is  non¬ 
descendant  of  Zj. 

2.  Test  if  sink  Zt  of  G*  satisfies  the  following  con¬ 
dition:  (Zi  1:  X  C,  T,  Y,  Z1} ...,  Zi  -  1)  or  (Zj  _L 
J_  Y  |  C,T,X,Zi,...,Zj  —  1).  If  so,  go  to  step  4. 
Otherwise,  continue. 


from  S  to  <1 )xy  is  first  sought  in  step  2,  starting  with 
all  observable  ancestors  of  S  that  are  non-ancestors 
of  Y.  If  the  test  succeeds  and  this  set  is  a  separator, 
the  algorithm  iterates  trying  to  separate  <&xy  from  the 
deepest  node  in  the  remaining  set.  In  case  of  failure, 
the  algorithm  attempts  (step  3)  to  achieve  separability 
using  pre-treatment  covariates  Tj.  In  case  no  separa¬ 
bility  can  be  found  using  these  added  covariates,  the 
algorithm  fails.  Otherwise,  at  the  end,  the  algorithm 
further  requires  that  all  T,  added  along  these  itera¬ 
tions  be  separable  from  Y  (step  5). 

To  illustrate,  running  the  procedure  on  the  graph  of 
Fig.  3(b)  with  C  =  {},  the  graph  remaining  after  the 
removal  of  S  has  two  sink  nodes,  W2  and  II' 3 .  Remov¬ 
ing  IT2  leaves  two  other  sinks,  W3,  and  II') .  Removing 
ITi  leaves  IT3  as  the  only  remaining  sink  node  which 
fails  the  test  of  Step  3.  Since  no  non-descendant  of  X 
exists  that  yields  separability,  we  must  exit  with  fail¬ 
ure.  On  the  other  hand,  if  we  are  able  to  measure  U, 
the  hidden  variable  responsible  for  the  double  arrow 
arc  between  W3  and  W4,  we  would  add  this  node  to 
T,  W3  will  pass  the  test,  followed  by  W4,  and  we  will 
end  up  with  U  as  the  only  non-descendant  of  X  re¬ 
maining  in  T.  In  step  5  we  remove  U  from  T,  yielding 
OR(X,  Y)  =  OR(X,  Y  |  W,  U,  S  =  1). 


3.  Test  if  there  exists  a  minimal  set  Ti  of  non¬ 
descendants  of  X  that,  if  added  to  T  would  render 
step  2  successful,  if  none  exists,  exit  with  failure.5 
Else,  add  Ti  to  T  and  continue  with  step  4. 

4.  Remove  Z,  from  G*  and  Z,  and  repeat  step  2 
recursively  until  Z  is  empty.  If  so,  go  to  step  5. 

5.  Test  if  (T  _LL  Y  \  C,X),  if  so,  the  sequence 
(Zi,  Z2,  ...Zm)  with  T  constitutes  a  witness  for 
the  OR-admissibility  of  Z  relative  to  (X,Y,  C), 
for  a  set  C  of  X-independent  variables.  Other¬ 
wise,  exit  with  failure. 

Proof.  See  Appendix.  □ 

The  algorithm  exploits  the  graph  structure  to  con¬ 
struct  a  mapping  from  the  observed  s-biased  data  and 
the  desired  target  OR.  Since  the  OR  is  symmetric, 
it  is  not  necessary  to  separate  S  from  X  and  Y  si¬ 
multaneously,  but  only  from  one  of  them  (given  the 
other.)  For  simplicity,  denote  the  expression  “X  given 
Y  or  Y  given  X”  by  the  symbol  &xy.  A  separating  set 

5  A  polynomial  time  algorithm  for  finding  a  minimal  sep¬ 
arating  set  in  DAGs  is  given  in  (Tian  et  al.,  1998).  The  re¬ 
stricted  minimal  separation  version  of  that  algorithm  finds 
a  minimal  separator  in  a  DAG  with  latent  variables  (equiv¬ 
alently,  semi-Markovian  models).  A  fast  test  for  the  non¬ 
separability  of  X  and  A,  is  the  existence  of  an  inducing 
path  between  the  two  variables  (Verma  and  Pearl,  1990). 
For  example,  the  path  X  —*  W4  — >  IT3  in  Fig.  3(b). 


Thus  far,  we  assumed  that  the  treatment  X  is  uncon¬ 
founded,  therefore  the  OR  is  identical  to  the  causal 
OR  defined  as  COR(X,Y)  =  In 

the  presence  of  confounding,  it  is  not  enough  to  recover 
OR  in  s-biased  data,  we  need  to  go  further  and  assure 
that  the  recovered  OR(X,  Y  |  C)  is  such  that  C  satis¬ 
fies  the  back-door  criterion  (2nd  rule  of  do-calculus, 
observing  and  intervening  are  equivalent),  in  which 
case  OR(X,Y  |  C)  will  represent  the  c-specific  causal 
OR.  For  example,  in  Fig.  2(c)  the  COR(X,  Y  |  C)  will 
be  G-recoverable  because  once  we  condition  on  C  all 
conditional  independencies  will  be  identical  to  those  of 
Fig.  1(c),  and  P(Y  \  do(X),  C)  =  P(Y  |  X,  C). 


Note,  however  that  although  we  can  recover  the  c- 
specific  causal  OR,  we  cannot  recover  the  population 
COR(X,Y).  For  such  measure  to  be  recoverable  we 
need  to  add  assumptions  which  will  make  it  possible 
to  infer  averageable  measures  of  causal  effects  such  as 
RD  and  RR,  to  be  handle  next. 


3  OR  and  other  measures  of  causal 
effects 

Consider  again  the  chain  structure  in  Fig.  2(a)  and  de¬ 
fine  the  causal  effect  as  COR(X,Y).  The  fact  that  X 
and  Y  are  not  confounded  permits  us  to  estimate  the 
causal  effect  COR(X,Y)  by  the  odd  ratio  OR(X,Y) 
which,  by  the  results  in  the  previous  section,  will  re- 
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Figure  4:  (a)  Constant  odds  ratio  curves  for  c  =  {1.00,1.01,1.50,2.00,5.00,10.00}  and  their  inverses;  Superim¬ 
posed  constant  odds  ratio  with  constant  risk  ratio  curves  (b)  and  constant  risk  difference  curves  (c) . 


main  invariant  to  conditioning  on  S  =  1.  However, 
if  we  define  the  causal  effect  as  ACE  =  Pr(y  \ 
do(x))  —  Pr(y  \  do(x'))  (also  known  as  the  causal  risk 
difference),  a  bias  will  be  introduced  upon  condition¬ 
ing. 

The  invariance  of  OR  can  be  represented  in  the  fol¬ 
lowing  intuitive  and  pictorial  way.  We  characterize  the 
conditional  distribution  P(Y  \  X )  by  two  independent 
parameters  p  =  P(y  \  x)  and  q  =  P(y  |  x'),  which 
define  a  point  (p,q)  in  the  unit  square.  The  condition 
OR(X,  Y)  =  c  describes  a  curve  in  the  {p,q)~ plane. 
For  c  =  1,  the  curve  is  the  unit  slope  line.  For  c  >  1, 
this  curve  separate  points  with  OR(.)  >  c  from  those 
with  OR{.)  <  c  in  the  region  below  the  unit  slope  line 
(symmetrically  for  the  inverses  (c  <  1)  in  the  region 
above  q  =  p).  See  Fig.  4. 

Now,  by  conditioning  on  S  =  1,  we  obtain  a  new 
conditional  probability,  also  characterized  by  two  in¬ 
dependent  parameters  ps  =  P(y  j  x,S  =  1  ),qs  = 
P(y  |  x',S  =  1).  The  fact  that  ORiY.X  |  S  =  1)  = 
OR(Y,X)  means  that  conditioning  on  S  =  1  must 
shift  the  initial  (p,  q)  point  along  a  constant  OR  curve, 
not  anywhere  else.  We  show  these  universal  curves  of 
constant  OR  for  c  =  {1.00, 1.01, 1.50,  2.00,  5.00, 10.00} 
and  their  respective  inverses  in  Fig.  4(a).  Fig.  4(b) 
shows  curves  for  constant  risk  ratio  (RR:  ^  =  c), 
which  are  variable  slope  lines  going  through  the  ori¬ 
gin,  and  bounded  by  the  slope  }.  Similarly,  Fig.  4(c) 
shows  curves  for  constant  risk  difference. 

We  see  that  even  though  RR  does  not  remain  constant 
(upon  conditioning),  the  constancy  of  OR  constrains 
the  behavior  of  the  RR.  This  follows  by  noting  (af¬ 
ter  some  algebra)  that  RR  =  c  +  (1  —  c)p ,  i.e. ,  RR 
has  intercept  c  and  slope  1  —  c.  For  instance,  if  OR  is 
constant  and  c  =  1,  we  have  unit  slope  line  for  OR, 


but  RR  does  not  move  and  is  equal  to  one.  For  con¬ 
stant  OR  and  \  <  c  <  1,  the  slope  is  positive  but  less 
than  i ,  and  the  intercept  is  greater  than  c  =  ,  which 

implies  that  RR  lies  inside  the  interval  [c,  1] .  Similar 
bounds  can  be  obtained  for  other  values  of  c. 


Recovering  RR  and  RD  under  selection  bias 

In  this  section  we  show  that,  in  some  situations,  point 
estimates  of  RR  and  RD  can  be  recoverable  from 
s-biased  data  in  studies  where  the  prior  probability 
P(X)  is  available.  6  In  other  words,  we  refer  back  to 
the  chain  structure  of  Fig.  2(a)  and  ask  whether  P[Y  \ 
X)  can  be  recovered  from  P(X)  and  P(X,Y  \  S  =  1). 

The  solution  can  be  obtained  algebraically,  noting  that 
Y  d-separates  X  from  S,  which  permits  us  to  write: 


P(X\Y)  = 


P(X\^Y)  = 


P(Y  |  X)P(X) 


P(Y  |  X)P{X)+P{Y  |  X)P(-nX ) 
P(-Y  |  X)P( X) 

P(-Xr  |  X)P(X)+P(-Ar  |  -W)P(-W) 


This  can  be  turn  into  a  two  linear  equations  with  two 
unknowns,  {P(Y  \  X),P(Y  \  ~W)},  which  gives: 


P(Y  |  X)  = 


p(x|y)(p(x|y)-p(x)) 

(p(X  |  Y)  -  P{X  |  -W)jp(X) 


6  Potentially,  we  are  under  a  RCT  setup  or  have  an  alter¬ 
native  way  to  access  it  through  external  studies  as  census’ 
data. 
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?(.i|F)^(ihr)-p(i) 

(p(X|XY)-P(X|Y))p(-X) 

(1) 

where  P(X  \  Y)  =  P{X  \  Y,  S  =  1  ),VX,  Y.* * * * 7 

This  simple  result  exemplifies  a  general  theme  of  cor¬ 
recting  for  selection  bias  (section  4);  the  bias  induced 
by  preferential  selection  can  be  removed  if  we  have 
enough  unconfounded  variables  that  constraint  the 
distribution  of  the  remaining  variables  in  a  specific 
way. 

Note  that  this  case  is  different  than  as  previously  dis¬ 
cussed  in  which  we  were  just  interested  in  the  OR. 
Next  we  extend  this  result  for  more  elaborated  scenar¬ 
ios. 

4  Randomization  with 

non-compliance  under  selection  bias 

Let  us  consider  the  more  general  problem  depicted  in 
Fig.  5(a)  in  which  confounding  and  selection  biases 
are  simultaneously  present,  and  there  are  instrumental 
variables  available. 

Our  goal  is  to  infer  the  most  accurate  bounds  for  the 
causal  effect  of  X  on  Y,  knowing  that  there  is  no 
unbiased  estimate  for  this  quantity  even  when  selec¬ 
tion  bias  is  not  present.  This  scenario  is  usually  pre¬ 
sented  under  the  rubric  of  “randomization  with  non- 
compliance”  ,  and  it  is  pervasive  in  the  Economics  liter¬ 
ature,  we  defer  to  (Pearl,  2009,  Ch.  8)  for  a  more  com¬ 
prehensive  discussion  of  the  relevance  of  this  setup,  we 
focus  here  on  the  technical  aspects  of  the  problem. 

Generally,  the  bounding  analysis  assumes  no  selection 
bias,  and  the  natural  question  that  arises  is  whether 
selection  bias  can  be  treated  and  under  which  condi¬ 
tions  bounds  free  from  selection  can  be  recovered. 

We  show  next  that  this  problem  can  be  solved  assum¬ 
ing  the  existence  of  two  instrumental  variables  Z\  and 
i?2.  8  Noteworthy,  the  set  of  assumptions  used  in  our 
analysis  are  commonplace  in  daily  Econometrics  prac¬ 
tice,  and  its  convoluted  appearance  is  diluted  when  one 
observes  them  more  vividly  through  the  causal  graph 
depicted  in  Fig.  5(a).  In  a  nutshell,  they  are  the  same 

'In  Epidemiology,  there  are  many  “longitudinal  data 

settings”  where  selection  bias  is  sequential,  in  which  it  can 

be  possible  easier  to  estimate  the  probability  of  selection 

instead  of  P(X)  -  this  observation  was  brought  to  our  at¬ 

tention  by  Onyebuchi  A.  Arah. 

8  Call  Z  =  Zi  U  Z2,  or  consider  one  IV  with  the  same 
number  of  levels.  Let  us  name  both  cases  by  instrumental 
variable  set. 


(a)  (b)  (c) 


Figure  5:  Different  scenarios  in  which  Theorem  4  can 
be  applied,  (a)  Typical  study  with  randomization  and 
non-compliance  (IV  as  incentive-mechanism)  where  se¬ 
lection  and  confounding  are  both  present,  (b)  Selection 
bias  in  the  back-door  case,  (c)  More  complex  study 
with  an  intermediary  variable  W  between  treatment 
and  selection.  In  this  case,  Y  directly  cause  W  and 
there  is  a  common  cause  between  them  (extension  of 
Fig.  1(c),  see  corollary  5.) 

assumptions  of  randomization  with  non-compliance 
together  with  selection  bias  (such  that  treatment  and 
outcome  affect  entry  in  the  data  pool) . 

Theorem  4.  The  joint  distribution  of  P(X,  Y,  Z)  is 
recoverable  from  s-biased  data  whenever  the  following 
conditions  hold:  (i)  the  S  node  is  affected  by  the  set 
Z  only  through  {X,  Y};  (ii)  the  set  Z  is  d-connected 
to  {X,Y}  (and  combinations);  (Hi)  the  dimensionality 
of  Zi  matches  the  dimensionality  of  {X,  Y};  (iv)  the 
marginal  probability  of  Z  is  known.  In  other  words, 
the  distribution  P(X,  Y,  Z)  is  recoverable  from  s-biased 
data  whenever  (S  _LL  Z  |  X,  Y),  (Z  _LjZ  {X,  Y}),  (Z 
If.  X  |  Y),(Z  If.  Y  |  X),  the  dimensionality  of  Z  and 
XUY  matches,  and  the  marginal  distribution  of  P( Z) 
is  given. 

Proof.  See  Appendix.  □ 

Corollary  3.  The  bounds  for  P(y  \  do[x))  in  the 
scenario  of  randomization  with  non-compliance  (Fig. 
5(a))  are  recoverable  from  s-biased  data  whenever  the 
conditions  of  the  Theorem  4  hold. 

Proof.  It  follows  directly  from  Theorem  4  together 
with  the  bounds  in  (Balke  and  Pearl,  1997).  □ 

Corollary  4.  The  causal  effect  P(y  \  do(x))  in  the 
back-door  scenario  (Fig.  5(b))  is  recoverable  from  s- 
biased  data  whenever  the  conditions  of  the  Theorem  4 
hold. 

Proof.  It  follows  directly  from  Theorem  4.  □ 

Corollary  5.  The  causal  effect  of  Oestrogen  (X )  on 
Endometrial  Cancer  (Y )  as  studied  in  (Horwitz  and 
Feinstein,  1978;  Hernan  et  ai,  2004)  (Fig.  5(c))  is 
recoverable  from  s-biased  data  whenever  there  is  an  IV 
set  Z  pointing  to  X,  and  the  conditions  of  the  Theorem 
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4  hold.  Moreover,  the  same  holds  without  relying  on 
Z  whenever  the  following  conditions  hold:  (i)  X  has 
the  same  dimensionality  of  {W,Y};  (ii)  the  marginal 
distribution  of  P(X)  is  available. 

Proof.  See  Appendix.  O 

Some  observations  on  the  method 

Methods  that  handle  selection  bias  under  different 
causal  assumptions  try  to  model  the  distribution  of  S, 
which  is  unobservable  and  usually  hard  to  estimate;  we 
take  a  different  approach  and  avoid  doing  this  explicit 
manipulation  of  the  selection  mechanism  by  exploiting 
the  topology  of  the  causal  graph  and  the  underlying 
data-generating  process.  We  are  not  aware  of  other  ap¬ 
proaches  trying  to  do  so. 

The  main  idea  is  to  exploit  the  conditional  indepen¬ 
dence  of  the  IV  set  Z  and  the  selection  mechanism  S 
given  the  distribution  of  the  treatment  and  outcome 
-  interestingly,  the  latter  is  what  we  seek  to  estimate. 
The  method  hinges  on  two  properties  about  the  in¬ 
duced  system,  that  it  is  linearizable  and  full  rank  - 
both  facts  were  not  obvious  nor  expected  a  priori. 

It  is  worth  to  make  some  additional  remarks  that 
follow  the  proof  of  Theorem  4.  First  note  that  the 
proposed  method  relies  on  a  sample  size  approach¬ 
ing  infinity,  which  is  difficult  to  obtain  in  practice.  As 
a  possible  improvement,  the  problem  could  be  cast 
as  an  optimization  problem.  The  formulation  goes 
as  follows.  We  associate  error  terms  eZlZ2:Xy  to  each 
r)z1z2,xy  term,  and  proceed  the  analysis  minimizing 
the  (square)  mean  error  subject  to  constraints.  The 
constraints  emerge  naturally  from  the  induced  system 
of  equations  together  with  the  additional  constraints 
of  positivity  and  integrality.  Our  original  goal  was  to 
show  feasibility  of  removing  selection  bias  (identifiabil- 
ity)  but  not  the  estimation  per  se,  still,  this  should  be 
an  interesting  exercise  to  pursue.  Further  investigation 
is  needed  to  check  the  applicability  of  this  suggestion. 

We  envision  our  method  being  used  as  a  first  step 
in  a  pre-processing  stage,  before  the  application  of 
any  bounding  (Balke  and  Pearl,  1997)  or  estimation 
procedure.  The  method  returns  the  same  values  of 
P(X ,  Y,  Z)  whenever  the  collected  data  is  not  under 
selection  bias,  which  means  that  its  usage  will  not  hurt 
and  should  be  considered  as  a  “good  practice.” 

Finally,  it  is  also  important  to  mention  that  there  are 
scenarios  not  solvable  by  our  method  or  in  which  our 
assumptions  are  not  applicable.  For  instance,  we  show 
in  Fig.  6  one  of  this  kind,  in  which  selection  and  con¬ 
founding  biases  are  entangled  in  such  way  that  it  does 
not  seem  possible  to  detach  one  from  another.  We  con¬ 
jecture  that  this  case  is  not  solvable  in  general  without 
further  assumptions.  Notice  that  even  if  we  remove  the 
edge  U  — >  X,  the  example  is  still  hard  to  resolve. 


Figure  6:  Scenario  in  which  selection  and  confounding 
biases  are  present,  entangled,  and  thus  not  recoverable. 

5  Conclusion 

We  showed  that  qualitative  knowledge  of  the  selection 
mechanism  and  the  use  of  instrumental  variables  can 
eliminate  selection  bias  in  many  realistic  problems.  In 
particular,  the  paper  provides  a  general  graphical  con¬ 
dition  together  with  an  algorithm  that  operates  on  a 
general  DAG,  with  measured  and  unmeasured  nodes, 
and  decides  whether  and  how  a  given  c-specific  odds 
ratio  can  be  recovered  from  selection-biased  data  char¬ 
acterized  by  a  selection  node  S.  We  further  showed  by 
algebraic  methods  that  selection  bias  can  be  removed 
with  the  help  of  instrumental  variables  under  a  mild 
set  of  conditions. 

This  paper  complements  recent  work  on  transporta¬ 
bility  (Pearl  and  Bareinboim,  2011)  which  deals  with 
transferring  causal  information  from  one  environment 
to  another,  in  which  only  passive  observations  can  be 
collected.  The  solution  to  the  transportability  problem 
assumes  that  disparities  between  the  two  environments 
are  represented  graphically  in  the  form  of  unobserved 
factors  capable  of  causing  such  disparities.  The  prob¬ 
lem  of  selection  bias  also  seeks  extrapolation  between 
two  environments;  from  one  in  which  samples  are  se¬ 
lected  preferentially,  to  one  in  which  no  preferential 
sampling  takes  place.  Both  problems  represent  envi¬ 
ronmental  differences  in  the  form  of  auxiliary  (selec¬ 
tion)  variables,  the  influence  of  which  we  seek  to  elim¬ 
inate.  However  the  semantics  of  those  variables  is  dif¬ 
ferent.  In  selection  bias  the  auxiliary  s- variables  repre¬ 
sent  disparities  in  the  data-gathering  process,  whereas 
in  transportability  problem  they  represent  disparities 
in  the  structure  of  the  data-generation  process  itself. 
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Appendix  —  Proofs 


Theorem  1 


(if  part)  Our  target  quantity  is  OR(X,  Y  |  C)  and 
given  that  Z  is  OP-admissible  relative  to  (A,  Y,  C), 
Corollary  2  permits  us  to  add  Z  and  rewrite  it  as 
OR(X,Y  |  C,Z).  Given  that  the  first  condition  of 
the  theorem  holds,  Corollary  1  implies  OR(X ,  Y  \ 
C,Z)  =  OP(A,  Y  |  C,Z,5  =  1).  This  establishes 
G-recoverability  since  the  r.h.s.  is  estimable  from  the 
available  s-biased  data. 


(only  if  part)  If  the  conditions  of  the  theorem  cannot 
be  satisfied,  then  OR(X,Y  |  C)  is  not  G-recoverable, 
that  is,  there  exist  two  distributions  Pi ,  P2  compatible 
with  G  such  that  they  agree  in  the  probability  under 
selection,  Pi(V  \  {5}  |  S=  1)  =  P2(V  \  {S'}  \S  =  1), 
and  disagree  in  the  odds  ratio,  OR\{X,  Y  |  C)  ^ 
OP2(A,  Y  |  C).  We  first  consider  the  case  when 
C  =  {},  and  we  will  construct  two  such  distributions. 
Let  Pi  be  compatible  with  the  graph  Gi  =  G,  and  P2 
with  the  subgraph  G2  where  all  edges  pointing  to  S 
are  removed.  Both  are  compatible  with  G,  since  com¬ 
patibility  with  a  subgraph  assures  compatibility  with 
the  graph  itself  (Pearl,  1988).  Notice  that  P2  harbors 
an  additional  independence  (V\{S}  _LL  S)p2.  By  con¬ 
struction  Pi(X,Y  |  S  =  1)  =  P2(A,  Y  |  S  =  1),  but 
since 

P2(A,Y|5=1)  =  P2(A,Y), 

we  have: 

Px{X,Y\S  =  1)  =  P2(A,Y) 


We  can  then  simplify  OP2  rewriting  it  as  follows 


QR  =  Pi(X1Y,S=l)P1(X,Y,S=l) 

2  P1(X,Y,S  =  l)P1(X,Y,S  =  l)f 
and  similarly  for  OR\, 


Pi(A,y)Pi(x,y) 

p^x.yjPipr.y) 


(3) 


We  want  to  show  that  it  is  possible  to  produce  a 
parametrization  of  P\  in  such  way  that  OPi(A,  Y) 
OP2(A,  Y).  First,  let  us  consider  the  class  of  Marko¬ 
vian  models.  Accordingly,  Pi  can  be  parametrized 
through  its  factors  in  the  Markov  decomposition 
Pi  (S'  =  1  |  PAs),Pl(A  |  PAX), . . .,  or  more  generally, 
Pi  ( Vi  |  PAi)  for  each  family  in  the  graph.  This  choice 
of  parameters  induces  a  valid  parameterization  for  P2 
as  well.  Firstly,  let  us  consider  the  case  in  which  con¬ 
dition  1  of  the  theorem  fails,  i.e.,  {X,  Y}  are  not  sepa¬ 
rable  from  S.  Thus,  eq.  (2)  can  be  rewritten  using  the 
identity  Pi  (A,  Y,  S  =  1)  =  Pi(S  =  1  |  A,  Y)Pi(A,  Y), 
yielding: 


OR2  =  OPi 


P1(S=1\X1Y)P1(S=1\X,Y)  \ 
P1(S  =  1\X,Y)P1(S  =  1\X,Y)J 


(4) 


Note  that  making  the  multiplier  of  OR\  in  eq.  (4)  dif¬ 
ferent  than  1  entails  OR2  yf  OR\,  which  will  happen 
for  almost  all  parametrizations  of  Pi(S  =  1  |  .)  inde¬ 
pendently  of  the  one  chosen  for  Pi  (A,  Y).  In  case  there 
are  additional  nodes  pointing  to  S,  we  can  just  make 
them  independent  of  S  in  this  new  parametrization 
given  that  compatibility  with  the  subgraph  is  enough 
to  ensure  compatibility  with  G. 

Now,  let  us  consider  the  case  in  which  condition  2  of 
the  theorem  fails,  i.e.,  there  is  no  OR-admissible  se¬ 
quence  in  relation  to  (A,  Y,  {}).  Let  Z  =  V\  {A,  Y ,  5}, 
and  expand  Pi  (A,  Y,  S  =  1)  in  the  following  way9: 

Pi  (A,  Y,S=l)  =  YJ  Pi( A,  Y,  5  =  1,  Z) 

Z 

=  Pi  (A  I  PAX)  ...  Pi  (5  =  1  I  PAS) 

z 

=e  n  PiwipAo  (5) 

z  vns=i 

Notice  that  each  term  in  eq.  (2)  can  be  rearranged  for 
each  assignment  of  S'  parents  (i.e.,  PAS  =  pai^),  for 
instance,  we  can  write  based  on  eq.  (5): 

Pi(A,y,5  =  i)  = 

Pi (S  =  1  |  PAS  =  paW,  A)  (  E  n  PiW  I  PAi) 

Z.PAs=pai1)  V\S 

Pi  (5  =  1  |  PAS  =  pai2\  A)  (  E  II  W  I  PAi) 

Z,PAs=pai2)  V\S 

Pi  (S'  =  1  |  PAS  =  paik\  A)  (  UP^\  PAi) 

V  Z.PAS— pa^i  V\5 

(6) 

where  k  is  the  number  of  configurations  of  «S”  par¬ 
ents,  and  A  indexes  configurations  of  A  or  Y  when¬ 
ever  one  of  them  is  a  parent  of  S.  Given  eq.  (6), 
let  us  call  Pi (5  =  1  |  PAS  =  pai1J,A)  =  a\, 

Pi (5  =  1  |  PAS  =  pai2),A)  =  «2>  •  •  ■>  ancl  a^so  call 
Z.PAg-pag  i>nV\spi(p*  i  pa  i)  =  fj(x,y)  for  each 
configuration  A  =  x,Y  =  y,  PAS  =  pa^*.  Then,  we 
can  write  eq.  (6)  in  the  following  simplified  manner: 

Pi  (A,  Y,  5  =  1)  =  +  o^f2{x,y)  +  . . .  (7) 


for  all  values  of  A  and  Y.  We  can  then  rewrite  OR2 


9It  clear  that  we  should  consider  in  the  expression  above 
(in  respect  to  Z)  just  the  nodes  that  are  somehow  related 
to  S,  i.e.,  its  ancestors,  otherwise  we  could  just  sum  these 
vertices  out  because  they  do  not  offer  any  additional  con¬ 
straint  over  the  distribution  of  interest  related  to  OR,  and 
then  in  its  respective  parameterization. 
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based  on  eq.  (7)  as 


OR2  = 


y)  +  a%f2(x,  y)  +  ...) 


(aifi{x,  y )  +  a2f2(x,  y)  +  ■■■) 
(«i  fi  (x,  y)  +  a%f2(x,  y)  +  ) 


(8) 


(a$fi(x,y)  +  a$f2(x,y)  +  . . .) 
and  similarly  for  OR±: 

(fi(x,y))  +  fi(x,y)  +  . .  .)(fi(x,y)  +  f2(x,y )  +  . 


Notice  that  OR2  in  eq.  (12)  is  the  weighted  arithmetic 
mean  of  /,(.)’ s  averaged  by  o^’s,  and  OR\  in  eq.  (13) 
is  the  arithmetic  mean  of  /,(.)’ s.  After  simplifications, 
the  remaining  parameters  lie  in  the  space  [0,  l]m+fe, 
where  m  is  the  number  of  free  parameters  in  /)(.)’ s. 
Note  that  OR\  —  OR2  =  0  adds  a  constraint  in  this 
space,  and  in  order  to  satisfy  it  we  should  choose  any 


point  in  a  surface  in  [0, 1] 


ra+fc— 1 


inside  [0, 1] 


m-\-k 


OR ,  = 


(fi(x,y))  +  h{x,y)  +  . .  y)  +  f2(x,  y)  +  . 

(9) 


There  is  an  important  observation  here.  Given  that 
there  is  no  admissible  sequence  relative  to  (X,  Y,  {}), 
there  exists  a  set  W  such  that  W  is  needed  to  sep¬ 
arate  S  from  X  or  Y,  but  also  (W  A]L  {X,Y}  |  Z'), 
for  Z'  non-descendents  of  W  and  in  Anc(S ),  otherwise 
there  will  exist  an  admissible  sequence.  If  W  is  differ¬ 
ent  than  {5},  it  is  the  case  that,  by  construction,  W 
is  contained  in  the  factor  fi(x,y).  Thus,  we  have  an 
asymmetry  given  that  W,  and  so  /*(),  change  depend¬ 
ing  simultaneously  on  the  specific  instantiation  of  X 
and  Y,  and  consequently  eq.  (8)  cannot  be  simplified 
in  the  general  case.  I.e.,  the  linear  combinations  en¬ 
coded  in  at  eq.  (8)  do  not  deteriorate,  factoring 

out  independently  of  the  given  parametrization  given 
that  there  is  a  different  element  in  each  one  of  them. 

Now  let  us  consider  the  following  parametrization  for 
P±:  set  Pi  (Vi  |  PA,)  =  1/2  for  all  families  except  for 
the  family  of  the  S  node  (i.e.,  P(S  =  1  |  PAS))  and  the 
exclusive  families  included  in  the  factor  fi(x,y)  (i.e., 
for  when  X  =  x,Y  =  y).  Thus,  rewrite  OR2  based  on 
eq.  (8): 


OR2  = 


(aifijx,  y)  +  a2f2(x ,  y)  +  ■ .  ■) 

(l/2yK  +  ^  +  ...) 


(10) 


where  l  is  equal  to  k  minus  the  number  of  summands  in 
the  respective  expression  (eq.  (6)).  Let  us  also  rewrite 
eq.  (9)  accordingly  with  this  given  parametrization, 
which  yields: 


,  i.e., 

which  has  Lebesgue  measure  zero.  Consequently,  if  we 
j  randomly  choose  parameters  the  equality  will  almost 
•j  never  hold  (and  the  inequality  OR\  ^  OR2  almost 
always),  and  then  just  randomly  draw  the  parameters 
from  [0,  l]m+fc  until  this  is  the  case,  which  finishes  this 
part  of  the  proof.  The  case  of  the  conditional  OR  is 
similar,  and  we  basically  have  to  write  appropriately 
eqs.  (2)  and  (3)  considering  C,  and  exactly  the  same 
reasoning  applies. 

For  the  case  when  the  graph  contains  unobservable 
variables,  the  proof  is  essentially  the  same  except  that 
an  appropriate  parametrization  of  the  underlying  gen¬ 
erating  model  should  be  used  -  for  such,  consider  the 
factorization  given  in  (Evans  and  Richardson,  2011). 

Theorem  2 

For  the  necessity  of  the  condition,  we  need  to  show 
that  the  failure  of  any  ancestor  A,  of  S  that  is  also 
a  descendant  of  X  (including  S  itself)  to  be  sepa¬ 
rated  (from  either  X  or  Y)  prevents  recoverability  of 
OR(Y,X  |  C).  Indeed,  At  cannot  be  part  of  admis¬ 
sible  sequence  nor  can  any  of  its  children  be  part  of 
an  admissible  sequence,  because  in  order  to  separate 
any  such  child  from  either  X  or  Y  we  would  need  to 
condition  on  the  father  A;,  and  then,  the  sequence 
will  become  non- admissible.  Proceeding  by  induction, 
we  eventually  reach  S  itself,  whose  failure  to  enter  an 
admissible  sequence  renders  the  existence  of  such  se¬ 
quence  impossible.  By  Theorem  1,  the  inexistence  of 
admissible  sequence  implies  the  not  G-recoverability 
of  OR(X,Y,C).  □ 

Theorem  3 


ORi 


(fi(x,y)  +  ,f2(x,  y)  +  . . .) 
k(l/2)1 


(11) 


After  applying  some  simplifications  on  eqs.  (10)  and 
(11),  we  obtain,  respectively, 


OR2 


(<*i  fifa  y)  +  q^2  h(x,  y)  + . .  •) 

(o:^  +  Q!  2  +  •  •  •) 


and 


ORi 


(h(x,y)  +  Mxaj)  +  •  ■  •) 
k 


(12) 


(13) 


We  use  along  the  proof  some  graphoid  axioms  and 
other  DAG  properties  as  shown  in  (Pearl,  1988).  Let 
us  first  consider  the  correctness  of  the  algorithm.  The 
main  idea  of  the  reduction  sequence  is  to  use  each 
conditional  independence  (Cl)  in  step  2  of  the  sink- 
procedure  to  substantiate  an  OR  reduction,  creating  a 
mapping  starting  from  the  s-biased  data  OR(X,Y  \ 
C,  Z\, ...,  Zk,  S  =  1)  and  reaching  the  target  (un¬ 
biased)  expression  OR(X,Y  \  C).  If  nodes  are  not 
added  in  step  3  of  the  algorithm,  it  is  obvious  that 
the  sequence  induces  a  valid  step-OR  reduction,  which 
witnesses  the  OR  G-recoverability.  So,  let  us  con- 
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sider  the  case  when  nodes  have  to  be  added  to  T 
along  the  execution  of  the  algorithm.  At  each  step  i, 
we  reduce  OR(X,Y  \  C,T,Z1,...,Zi)  to  OR(X,Y  \ 
C,  T,  Zlt Zi_i)  allowed  by  the  Cl  in  step  2.  But 
given  that  Ti  can  be  added  to  T  along  the  execution 
of  the  algorithm,  we  need  to  show  that  this  operation 
is  allowed,  i.e.,  it  does  not  invalidate  the  construction 
of  the  desired  mapping  between  the  unbiased  OR  and 
the  s-biased  one.  Towards  contradiction,  consider  an 
arbitrary  node  Zj  such  that 

(2,11  |  C ,  T ,  Y ,  Z\ , ...,  Z j — i )  or 

(ZjALY\C,T,X,Z1,...,Zj_1)  (14) 

Now,  consider  the  first  Zj.  such  that  k  <  j  and,  in 
order  to  satisfy  step  2  in  the  sink-procedure,  W  has 
to  be  added  to  the  conditioning  set,  then 

(■ ZkALX  |  C ,T,Y,Zi,...,Zk-i,W)  or 

(ZkjL  Y  |  C,T,X,Zu...,Zk-ltW)  (15) 

but  also 

(Zj  J,  X  |  C,T,Y,Z1,...,Zj_1,W)  or 

(Zj  iL  Y  |  C ,T,X,Zu...,Zj^,W)  (16) 

is  false.  If  the  sink-procedure  ends,  it  is  also  true  that 

(TiLY|C,X)  (17) 

From  eq.  (14),  all  paths  from  Zj  to  X  or  Y  (including 
the  ones  passing  through  W)  are  closed  after  condi¬ 
tioning  on  {C,  T,  Y,  Z\, ...,  Zj-i}.  From  eq.  (15)  and 
the  minimal  choice  of  T)  in  step  3,  it  must  be  the  case 
that  there  is  a  path  p  from  Z k  to  X  or  Y  such  that 
p  is  blocked  by  some  W  £  W.  From  eq.  (16),  there 
exists  a  path  p'  that  has  to  be  open  after  condition 
on  W,  and  therefore  there  exists  a  collider  U  such 
that  U  =  W  or  W  £  Desc(U).  Let  us  consider  two 
possible  scenarios  for  p',  the  first  when  it  goes  from 
Zj  to  Y,  and  the  second  when  it  goes  from  Zj  to  X. 
In  the  former  case,  there  is  an  open  path  from  W  to 
Y,  which  is  a  contradiction  with  eq.  (17)  given  that 
W  C  T.  Then  it  must  be  the  case  that  W  only  blocks 
paths  ending  in  X,  so  let  us  assume  the  case  in  which 
the  end  node  in  p'  is  X.  From  (15),  p  is  such  that 
Z^  < ><—  ...  —  W  —  ...  — >< >  X,  where  we  are  con¬ 
dition  on  all  intermediate  converging  arrows  and  W 
must  be  a  chain  or  a  common  cause  (i.e.,  — >  W  — >  or 
<—  W  — >).  Split  p  into  pi  :  Zk  ■  ■  ■  W,  and  p2  :  W  . . .  X. 
From  eq.  (16),  p'  is  such  that  W  opens  a  collider  U, 
then  the  path  from  Zj  to  X.  Split  p'  into  p'±  :  Zj...  — ■>  U 
and  p'.2  :  U  <—  ...X.  Now  we  have  two  possibilities. 
If  p2  is  such  that  W  — »  ...  X,  we  can  concatenate 

Zk^U  — >  W  5-  X,  which  shows  an  open  path  from 
Zk  to  X  even  before  conditioning  on  W,  contradiction. 


If  p2  is  such  that  W  .X,  p\  must  be  W  — >  . . .  Zk, 
and  we  have  two  possibilities:  (a)  Zk  can  be  a  descen¬ 
ded  of  W ,  and  in  this  case  the  collider  in  U  is  already 
open  even  without  conditioning  on  W,  contradiction; 
(b)  W  is  connected  to  Zk  through  some  collider,  for 
instance,  pi  could  be  IF  C  <—  ...  Zk,  but  sim¬ 

ilarly  as  before,  given  that  we  condition  on  C ,  which 
is  a  descended  of  IF,  and  so  of  U,  the  collider  was 
already  conditioned  as  well  as  the  path  from  Zk  to  X 
open,  contradiction.  Therefore,  it  cannot  be  the  case 
that  after  adding  Tk  C  NonDesc(X)  to  block  paths 
from  Zk  to  X  or  Y,  there  is  a  node  Zj  such  that 
k  <  j,  and  which  previously  had  its  paths  to  X  or 
Y  blocked,  turned  to  have  them  open  after  condition¬ 
ing  on  Tk-  Thus,  we  are  allowed  to  modify  each  Cl 
obtained  in  step  2  before  Zk  in  the  sequence  adding 
Tkl  and  then  based  on  the  admissible  sequence  start¬ 
ing  from  OR(X,  Y  |  C,  T,  Z\, ...,  Zn),  we  can  reduce  it 
through  this  new  augmented  CIs  of  step  2  until  reach¬ 
ing  the  desired  expression  OR(X,Y  |  C). 

Now  we  consider  the  complexity  of  the  algorithm,  and 
we  show  that  it  runs  in  polynomial  time.  Notice  that 
only  the  step  3  of  the  algorithm  could  imply  some 
backtracking  -  i.e.,  when  it  chooses  a  (minimal)  set 
T;  of  non-descendants  of  X  that  renders  the  equality 
in  step  2  to  be  true.  The  choice  of  separating  set  per 
se  is  polynomial,  see  footnote  5. 

Consider  that  the  choice  of  T,  implies  failure  in  step  5 
when  it  tests  the  validity  of  (T  _LL  Y  \  X,  C).  Assume 
that  it  exists  a  sequence  Q  of  ancestors  of  S  and  not 
ancestors  of  X,  (Z\, ...,  Zk, ...,  Zn)  such  that  for  each 
Zi  there  is  a  separating  set  T,  which  makes  the  inde¬ 
pendence  test  valid.  Let  T  =  (J  Ti,  and  assume  that 
(T  _LL  Y  |  X,  C)  holds.  Assume  now  that  in  round  k, 
the  sink  procedure  chooses  a  different  (minimal)  sepa¬ 
rating  set  than  Tk,  and  call  this  new  set  T(.,  and  subse¬ 
quently  (T(.+1, ...,  T(J.  We  have  the  new  sequence  Q' 
with  additional  separators  (Ti, ...,  Tk_i,  T^, ...,  T(J. 
Call  T'  =  (J  Ti,  and  A  =  T'\(Tn  T'). 

We  have  that  (T'  _U/  Y  \  X,  C)  holds,  or  just 
(A  11/  Y  |  X,  C).  (This  follows  from  (A  _LL  Y  \ 
X,  C),  which  by  composition  yields  (T'  _LL  Y  \  X,  C), 
contradiction.  See  also  (Pearl  and  Paz,  2010).)  Let 
<5  £  A  be  the  first  node  such  that  that  Q  and  Q' 
disagree  and  which  make  step  5  to  fail.  6  blocks  at 
least  one  path  from  Zk  to  X  (after  condition  on 
{C,  y,  T,  Z\, ...,  Zk- 1,  Ti  \  <5})  or  from  Zk  to  Y  (after 
condition  on  {C,  X,  T,  Z\, ...,  Zk- 1,  Ti  \  <$}),  otherwise 
the  sequence  will  not  be  admissible  (pass  in  the  test 
of  step  2).  By  construction,  it  must  be  the  case  that 
there  is  an  open  path  from  Zk  to  Y  passing  through  <5 
(after  cond.  on  {C,X,Q,  Zi, ...,  Zk-i,rY-l\5}). 

Let  p  be  part  of  this  path  from  6  to  Y  (or,  S  —  ...  —  Y). 
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There  must  exist  in  Q  a  vertex  v  which  blocks  this 
same  path  from  Zk  to  { X ,  Y }  or  {Y }  in  the  test 
of  step  2.  But  v  is  in  p  or  connected  through  an 
open  path  p'  to  <5  (i.e.,  p  :  8  —  ...  —  v  —  ...  —  Y  or 
v  —  ...  —  p'  —  ...  —  5  —  ...  —  p  —  ...  —  Y),  otherwise  we 
would  not  need  5  in  the  first  place,  contradicting  min¬ 
imality.  In  both  cases,  there  is  an  open  path  from  v  to 
Y,  which  contradicts  the  assumption  about  Q  validat¬ 
ing  (T  _LL  Y  |  X,  C)  as  true,  and  therefore  it  cannot  ex¬ 
ist  such  8.  Applying  the  same  reasoning  for  the  whole 
sequence  Q'  inductively,  we  conclude  that  it  cannot 
exist  such  sequence.  Therefore,  step  5  does  not  imply 
any  backtracking. 

Similarly,  let  us  consider  the  case  when  the  choice  of  Tj 
implies  failure  in  a  subsequent  step  2.  In  the  sequence 
Q',  it  is  true  that  when  the  algorithm  chooses  Tj  to 
satisfy  the  admissibility  of  Zv  it  blocks  some  paths 
from  Zj  to  X.  Now,  assume  that  for  Zk,  k  <  j,  there 

is  an  open  path  through  Tj,  i.e.,  Zk  < - >  U  < - >  X, 

where  U  =  Tj  or  Tj  £  Desc(U).  But  if  you  do  not 
choose  Tj  (or  any  other  node  that  blocks  this  path), 
we  would  have  an  open  path  from  Zk  to  X  through 
Tj,  contradiction. 

We  now  argue  about  the  completeness  of  the  proce¬ 
dure.  Let  us  first  consider  the  case  in  which  there  is  not 
X-independent  variable  in  the  admissible  sequence, 
the  sink-procedure  will  return  an  admissible  sequence 
whenever  one  exists.  Notice  that  the  sink-procedure 
performs  a  search  for  an  admissible  sequence  in  re¬ 
verse  topological  order,  and  this  only  makes  the  con¬ 
ditional  independence’s  tests  easier  than  in  any  other 
order.  This  is  so  because  in  each  step,  we  are  adding 
all  non-descendents  of  Zk  (are  non-colliders  for  Zk), 
which  completely  disconnects  Zk  from  X  or  Y  except 
for  paths  passing  through  non-descendents  of  X.  (Also, 
non  step-wise  reductions  can  be  converted  to  step-wise 
one  through  the  graphoids  decomposition  and  weak 
union.) 

Assume  that  there  is  a  sequence  (Ai, ...,  Am)  called 
A  that  does  not  follow  the  order  given  by  the  sink- 
procedure  and  it  is  admissible.  Now,  let  us  call  Q  the 
sequence  (Z i, ...,  Zn)  given  by  the  sink-procedure,  and 
further  assume  that  Q  is  not  admissible.  It  is  true 
that  the  last  element  of  both  sequences  is  S,  and  in 
Q  we  would  have  the  blocking  set  {Z\, ...,  Zn_ i}  while 
in  A  we  would  have  {A\, ...,  Am_i}.  It  is  true  that 
{Ai, Am_i}  C  {Z\, ...,  Zn_i},  and  this  is  an  in¬ 
variant  along  the  algorithm  for  all  nodes  in  A.  Recall 
two  facts:  (a)  for  now,  we  are  assuming  that  there  are 
not  disagreements  between  Tq  and  Ta;  (b)  adding 
descendents  of  Zk  in  each  step  can  only  open  some 
paths  and  spoil  separation.  It  must  be  the  case  for 
the  sink-procedure  to  fail,  there  exists  Zk  €  Q  such 
that  (Zk  1L  X  |  Y,  C,  Zk, ...,  Zk~\)  and  (Zk  1  Y  | 


X,  C,  Z\, ...,  Zk-i)  are  both  false.  Thus,  there  is  at 
least  one  path  from  Zk  to  X  and  from  Zk  to  Y  that  are 
not  blocked  by  {Z\, ...,  Zk~  1}  U  {C}  (and  respectively, 
{Y}  and  {X});  call  the  set  of  these  paths  Pi  and  P2, 
respectively. 

Assume  that  A  also  chooses  Zk  at  some  point  along  its 
execution,  and  Zk  is  labeled  there  Arn .  It  must  be  the 
case  that  all  paths  from  Am  to  X  or  all  paths  from 
Am  to  Y  are  blocked  by  |}  U  {C}  (and 

respectively,  {Y}  and  {X'}).  But  if  {A\, ...,  Am_i}  C 
{Z\, ...,  Zk~i},  this  is  a  contradiction.  Now  assume 
that  A  does  not  choose  Zk  along  its  execution.  There 
are  ancestors  of  S  which  have  to  block  Pi  from  S  to 
X  or  P-2  from  S  to  Y,  and  we  consider  without  loss 
of  generality  the  subset  {Ai, A{\  that  renders  this 
separation  to  hold.  Consider  Aj  the  first  descendant 
of  Zk  in  G*  that  is  in  {Ai, ...,  Ai}.  If  such  node  is  S, 
we  reach  a  contradiction.  Assume  that  Aj  is  not  S  but 
some  of  its  ancestors.  To  separate  Aj  from  X  or  Y,  we 
need  to  block  the  paths  from  it  to  X  or  Y,  but  there 
are  unblockable  paths  Pi  and  P2  passing  through  Zk 
(Aj  < —  ...  —  Zk  —  Pi  —  X  or  Aj  < —  ...  —  Zk  —  P2  —  L  ), 
and  therefore  Aj  cannot  be  part  of  an  admissible  se¬ 
quence,  contradiction.  Then,  it  is  the  case  that  if  both 
algorithms  do  not  disagree  in  the  choice  of  the  non- 
descendents  of  X,  there  is  indeed  not  admissible  se¬ 
quence.  For  the  case  when  we  add  X-independent  vari¬ 
ables  along  the  sequence,  the  result  also  follows,  and 
this  is  so  based  on  the  fact  shown  previously  that  there 
is  no  backtracking  in  the  choice  of  Tj,  and  any  algo¬ 
rithm  that  chooses  7)  consistently  obtains  the  same 
outcome  in  terms  of  separation.  Each  time  that  the 
sink-procedure  does  not  return  any  sequence,  we  can 
produce  a  counter-example  for  the  G-recoverability  of 
the  triplet  (X,  Y,  C)  based  on  the  construction  of  The¬ 
orem  1.  □ 

Theorem  4 

Let  us  first  show  the  result  for  the  binary  case.  To 
match  the  dimensionality  requirement,  we  assume  that 
Z  =  Zi  U  Z2  and  both  Zi  and  Z2  are  binary  satisfying: 

P(Zi ,Z2\X,Y,S)  =  P(Zi,Z2  I  X,  Y)  (18) 


To  simplify  the  notation,  let  us  write: 

•  P(X  =  x,Y  =  y  \  Zi  =  zi,Z2  =  z2)  =  axytZlZ2 

•  P(Z  1  =  Zi,  Z2  =  z2)  =  (3ZlZ2 

•  P(Zi  =  zi,  Z2  =  z2  |  X  =  x,  Y  =  y)  =  7 ZlZ2,xy 


Note  that  the  parameters  jZlZ2tXy  and  /3ZlZ2  impose 
constraints  on  the  distribution  axy>ZlZ2,  which  can  be 
made  explicit  by  the  following  equation, 


lz\Z2,xy 


lxy,ztz2  HZ!Z2 


Pz 


E, 


a. 


xy,ziz'2 


(19) 
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M 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1 

(ci  -  1)61 

Cl  62 

Cl  63 

Cl  64 

2 

C2f>l 

(c2  -  1)62 

C2&3 

C264 

3 

C3&1 

C3&2 

(C3  -  1)63 

C3&4 

4 

(c4  -  l)6l 

C4&2 

C4&3 

ci  64 

5 

C5&1 

(c5  -  1)62 

C5&3 

C564 

6 

C6&1 

C6&2 

(c6  -  1)63 

cebi 

7 

(c7  -  1)61 

C7&2 

C763 

c7i>4 

8 

c8  i>l 

(c8  -  l)t2 

csb  3 

CS&4 

9 

C961 

C9&2 

(c9  -  1)63 

C964 

10 

(1  —  Cio)&l 

— C10&2 

—  Cl0^3 

— C1064 

(1  —  Cio)f>l 

—  Cl0^>2 

-C10&3 

— C10&4 

(1  —  Cio)&1 

— C1062 

-C10&3 

— C10&4 

11 

-C1161 

(1  -  Cu)b2 

-C11&3 

— C1164 

— C1161 

(1  -  cn)b2 

-C11&3 

-C11&4 

—cubi 

(1  -  Cn)62 

-C11&3 

-C1164 

12 

-C12&1 

— C1262 

(1  ~  Cl2)&3 

— C1264 

-C12&1 

—c\2b2 

(1  -  ci2)f>3 

— C1264 

-C12&1 

—c\2b2 

(1  —  Cl2)63 

-C12&4 

Now,  for  a  given  assignment  <  X  =  0,  Y  =  0  >,  let  us 
list  all  independent  parameters  7ZlZ2,oo> 


constants  b\  —  64,  and  7 ZlZ2,xy  as  constants  ci  —  Ci2- 
The  matrix  is  shown  on  the  top  of  the  previous  page. 


«oo,oo/3oo 

7oo,oo  —  a - 

2~jZy,z2  ^00 ,z'^z2l^  z'-yZ2 

01  oo.oiPoi 

7oi,oo  —  x - 

Z^z[,z2  ^00 ,z'1z2Pz,1z2 

CH  oo,io/3io 

710,00  —  -3 - 

z—*z'^,z2  ^00, z^z^Pz^z^ 


(20) 


Note  that  7n,oo  is  not  an  independent  parameter  be¬ 
cause  it  is  completely  determined  by  the  other  three 
equations  in  (20)  given  the  integrality  constraint.  For 
now,  we  have  3  equations  and  4  unknown  variables 
({<aoo,oo>  ooo,oi>  «oo,ioi  «oo,n}-) 


Similarly,  we  write  the  constraints  for  the  assignments 
<  X  =  1,Y  =  0  >  and  <  X  =  0,Y  =  1  >,  respec¬ 
tively, 


QJo,ooPoo 

7oo,io  —  ^ - 3 - >  <« 

z2  0^10, z'1z2Pz'1z2 

_  <aoi,oo/3oo 

7oo,oi  —  - 3 - 

2-*jz'x,z'2  ^OltZ^Z^Pz^Z^ 


(21) 

(22) 


In  what  follows,  we  exploit  the  block  structure  of  M 
and  apply  the  following  transformations  to  better  vi¬ 
sualize  its  determinant. 

1.  First  note  that  all  columns  {1, 5,  9}  are  multiplied 

by  61,  which  can  be  factored  out  by  the  deter¬ 
minant  property.  Similarly  for  the  other  columns 
in  respect  to  {£*2 ,  £*3 ,  },  which  can  be  expressed 

as  det(M)  =  (b^b^b^)3  det(M^) ,  where  M ^  is 
the  resultant  matrix. 

2.  Let  us  sum  lines  {1, 4,  7}  to  line  10,  lines  {2,  5, 8} 
to  line  11,  and  {3, 6,  9}  to  line  12,  which  generate 
matrix  M^2\ 

3.  We  now  sum  the  columns  of  M^2\  —1  times  col¬ 
umn  4  to  column  1,-1  times  column  4  to  column 
2,  and  —1  times  column  4  to  column  3  (similarly 
for  the  other  blocks),  which  yields  M^\ 

4.  Sum  the  columns  of  ,  c\  times  column  1,  C2 
times  column  2  and  C3  times  column  3  to  column 
4  (similarly  for  the  other  blocks),  yielding  M^. 

5.  Now,  reorder  the  columns,  “pushing”  column  4 
and  8  towards  the  end,  call  the  resultant  matrix 


Now,  we  can  write  the  equations  for  the  constraints 
relative  to  the  variables  a.\\yZlZ2  as  a  function  of  the 
previous  variables  {aoo,Zlza,  otmyZlZ2,  aw>ZlZ2}, 


Notice  that  the  parameters  jZlZ2lu  are  independent, 
and  we  have  12  equations  and  12  unknowns,  but  it  re¬ 
mains  to  show  that  the  equations  are  all  independent 
(notice  that  the  last  three  constraints  in  eq.  (23)  in¬ 
volve  variables  of  the  other  constraints).  Another  fact 
to  observe  is  that  the  system  is  indeed  linear.  We  show 
that  the  matrix  M,  induced  by  the  eqs.  (20,  21,  22,  23), 
is  linear  and  (almost  surely)  invertible,  and  generates 
an  unique  solution.  M  is  invertible  if  and  only  if  its  de¬ 
terminant  is  non-zero.  For  convenience,  let  us  display 
the  variables  axyyZlZ2  column-wise,  renaming  /3ZlZ2  as 


Now  we  are  done,  notice  that  the  det(M)  = 
(b^bzbiYdet^M^),  and  the  determinant  of  M ^  is 
the  determinant  of  two  block  matrices,  the  square  rna- 
trix  '  from  lines  1-9  multiplied  by  another  square 
matrix  from  lines  10-12.  Note  that  det(M^)  = 

—  1,  and  remains  to  show  that  det(M ^)  is  almost  al¬ 
ways  different  than  zero.  The  parameters  C\  to  C12  are 

(5) 

independent,  and  given  the  form  obtained  to  M2  ' 
where  all  entries  are  independent,  this  implies  that 
M2^  is  non-singular  almost  surely,  and  so  it  is 

-  coincidental  cancellations  will  occur  with  Lebesgue 
measure  zero. 

Therefore,  we  consider  M  as  full  rank,  which  can  be 
solved  algebraically  with  standard  techniques  yielding 
the  solution  a  =  M-1 7.  This  result,  together  with 
P(Z)  yields  the  joint  distribution  P(Y,  A,  Z).  The  case 
for  non-binary  variables  follows  in  a  straightforward 
way,  just  noticing  the  requirement  for  agreement  be¬ 
tween  the  dimensions  of  the  IV  set  Z  and  {X,  Y}.  □ 
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Corollary  5 

First,  apply  Theorem  4  to  the  variables  {  W,  Y}  replac¬ 
ing  X  with  W,  and  obtain  P(W,  Y).  Further  note  that 
P{X  |  Y,  W,  S  =  1)  =  P(X  |  Y,  W ) ,  which  together 
with  the  first  observation  finishes  this  part  of  proof. 
The  proof  for  when  we  do  not  rely  on  Z  is  essentially 
the  same.  □ 
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